home *** CD-ROM | disk | FTP | other *** search
- Path: news.halcyon.com!usenet
- From: normanb@halcyon.com (Norm Bryar)
- Newsgroups: comp.lang.c++
- Subject: Re: I need help with a data structure
- Date: Fri, 05 Apr 1996 16:23:33 GMT
- Organization: Northwest Nexus Inc.
- Message-ID: <4k3hcr$230@news.halcyon.com>
- References: <4k2fe8$cb9@newsource.ihug.co.nz>
- NNTP-Posting-Host: blv-pm2-ip12.halcyon.com
- X-Newsreader: Forte Free Agent 1.0.82
-
- hoggy@ihug.co.nz (John Hogg) wrote:
-
- >Hi, I'm quite new to programming so I would like some advice on a program
- >I'm going to write in C++.
-
- >I have a number of chapters for a book, stored as text files (ie. all
- >ASCII characters). I need to 'read' through these chapters, then store
- >and display the new words that occur in each chapter so that a glossary
- >can be made for the key words. I also have a text file that contains
- >common words that should be omitted (ie. "a", "the" etc). There are no
- >more than 1,000,000 words in each chapter and I estimate there will be
- >no more than 30,000 different words.
-
- >I have decided I should hash the words from the common word file into an
- >array. Then, each word I read in from the chapters could be checked
- >against this array and if it is not there I would hash it into another
- >array of 30,000 items, discarding it if it is already there. Then this
- >new word could be added to a binary tree to sort the words into
- >alphabetical order for storage and display purposes.
-
- >There is the added problem of words such as 'sweating' and æsweatÆ where
- >the word æsweatÆ would only be required. I could simply check if a word
- >has 'ing' on the end and remove it but then 'dancing' would become 'danc'.
- >Maybe it would be quicker to display all the words and get the user to
- >delete the words not required.
-
- >I would appreciate any help you can give me.
-
- I thnk you're on the right track. You may want a double-hash of the
- words you're collecting: this is a 3-D array, kind of, where you hash
- the word with one algorithm and, instead of finding a list, you find
- another hash-table. You hash the word with this alternate algorithm
- and then find your list. I think your hash searches may turn out
- faster if your number of buckets is square-rooted this way.
-
- The 'ing' problem is known as word-stemming. Perhaps some archie or
- web search for full-text-search or word-stemming will help you.
- Hopefully you can find the established rules that will tell you "if
- 'ing' preceeded by 'c', replace with 'e'" or something.
-
- Good luck.
- --Norm
-
-